We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing.
translated by 谷歌翻译
Recent research has demonstrated the capability of behavior signals captured by smartphones and wearables for longitudinal behavior modeling. However, there is a lack of a comprehensive public dataset that serves as an open testbed for fair comparison among algorithms. Moreover, prior studies mainly evaluate algorithms using data from a single population within a short period, without measuring the cross-dataset generalizability of these algorithms. We present the first multi-year passive sensing datasets, containing over 700 user-years and 497 unique users' data collected from mobile and wearable sensors, together with a wide range of well-being metrics. Our datasets can support multiple cross-dataset evaluations of behavior modeling algorithms' generalizability across different users and years. As a starting point, we provide the benchmark results of 18 algorithms on the task of depression detection. Our results indicate that both prior depression detection algorithms and domain generalization techniques show potential but need further research to achieve adequate cross-dataset generalizability. We envision our multi-year datasets can support the ML community in developing generalizable longitudinal behavior modeling algorithms.
translated by 谷歌翻译
Federated learning (FL) enables the building of robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without centralizing the data. We created NVIDIA FLARE as an open-source software development kit (SDK) to make it easier for data scientists to use FL in their research and real-world applications. The SDK includes solutions for state-of-the-art FL algorithms and federated machine learning approaches, which facilitate building workflows for distributed learning across enterprises and enable platform developers to create a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package, and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, XGBoost, or even NumPy) and apply them in real-world FL settings. This paper introduces the key design principles of FLARE and illustrates some use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms. Code is available at https://github.com/NVIDIA/NVFlare.
translated by 谷歌翻译
深图像先验(DIP)是一种最近提出的技术,用于通过将重建图像拟合到未经训练的卷积神经网络的输出中来解决成像反问题。与预处理的前馈神经网络不同,相同的倾角可以概括为任意逆问题,从降级到阶段检索,同时在每个任务下提供竞争性能。DIP的主要缺点是,虽然前馈神经网络可以在单个通行证中重建图像,但DIP必须以大量的计算成本逐渐更新数百到数千个迭代的权重。在这项工作中,我们使用元学习来大规模加速基于倾斜的重建。通过学习浸入权重的适当初始化,我们证明了在一系列逆成像任务中的运行时间有10倍的改善。此外,我们证明了一个经过训练以快速重建面孔的网络也将其推广以重建自然图像贴片。
translated by 谷歌翻译
最近,Deep Models已经建立了SOTA性能,用于低分辨率图像介绍,但它们缺乏与现代相机(如4K或更多相关的现代相机)以及大孔相关的分辨率的保真度。我们为4K及以上代表现代传感器的照片贡献了一个介绍的基准数据集。我们展示了一个新颖的框架,结合了深度学习和传统方法。我们使用现有的深入介质模型喇嘛合理地填充孔,建立三个由结构,分割,深度组成的指南图像,并应用多个引导的贴片amatch,以产生八个候选候选图像。接下来,我们通过一个新型的策划模块来喂食所有候选构图,该模块选择了8x8反对称成对偏好矩阵的列求和良好的介绍。我们框架的结果受到了8个强大基线的用户的压倒性优先,其定量指标的改进高达7.4,而不是最好的基线喇嘛,而我们的技术与4种不同的SOTA配对时,我们的技术都会改善每个座椅,以使我们的每个人都非常偏爱用户,而不是用户偏爱用户。强大的超级分子基线。
translated by 谷歌翻译
本文的目的是描述一个用于在合成数据库中生成合成顺序数据的系统。为了实现这一目标,我们在SDV中介绍了当前的顺序模型,SDV是一个端到端框架,该框架为多序列,现实世界数据构建生成模型。这包括一个新型的基于神经网络的机器学习模型,条件概率自动回归(CPAR)模型。总体系统和模型可在开源合成数据保险库(SDV)库中获得{https://github.com/sdv-dev/sdv},以及用于不同合成数据需求的其他各种模型。构建顺序SDV后,我们使用它来生成合成数据,并将其质量与现有的非序列生成对抗网络的模型进行了比较。为了将顺序合成数据与其实际对应物进行比较,我们发明了一个称为多序列汇总相似性(MSA)的新指标。我们用它来得出结论,我们的顺序SDV模型比非综合数据质量的任何权衡取舍都学到了更高的级别模式。
translated by 谷歌翻译
病理诊断依赖于组织学染色的薄组织样品的目视检查,其中使用不同类型的污渍来对比并突出各种所需的组织学特征。但是,破坏性的组织化学染色程序通常是不可逆的,因此很难在同一组织段上获得多个污渍。在这里,我们通过层叠的深神经网络(C-DNN)演示了虚拟的染色转移框架,以数字化将苏木精和曙红(H&E)染色的组织图像转化为其他类型的组织学染色。与单个神经网络结构不同,该结构仅将一种染色类型作为一种输入来输出另一种染色类型的数字输出图像,C-DNN首先使用虚拟染色将自动荧光显微镜图像转换为H&E,然后执行从H&E到另一个域的染色转换以级联的方式染色。在训练阶段的这种级联结构使该模型可以直接利用H&E和目标特殊污渍的组织化学染色图像数据。该优势减轻了配对数据获取的挑战,并提高了从H&E到另一个污渍的虚拟污渍转移的图像质量和色彩准确性。我们使用肾针芯活检组织切片验证了这种C-DNN方法的出色性能,并将H&E染色的组织图像成功地转移到虚拟PAS(周期性酸 - 雪)染色中。该方法使用现有的,组织化学染色的幻灯片提供了特殊污渍的高质量虚拟图像,并通过执行高度准确的污渍转换来创造数字病理学的新机会。
translated by 谷歌翻译
开发了基于深度学习的虚拟染色是为了将图像与无标签的组织截面形成鲜明对比,以数字方式与组织学染色相匹配,组织学染色是耗时,劳动密集型且与组织破坏性的。标准的虚拟染色需要在无标签组织的整个幻灯片成像过程中使用高的自动对焦精度,这会消耗总成像时间的很大一部分,并可能导致组织光损伤。在这里,我们介绍了一个快速的虚拟染色框架,该框架可以染色未标记的组织的散焦自动荧光图像,从而达到与无焦标签图像的虚拟染色相同的性能,还可以通过降低显微镜的自动焦点来节省大量的成像时间。该框架结合了一个虚拟自动化的神经网络,以数字重新聚焦了散落的图像,然后使用连续的网络将重新聚焦的图像转换为几乎染色的图像。这些级联网络构成了协作推理方案:虚拟染色模型通过培训期间的样式损失使虚拟自动化网络正常。为了证明该框架的功效,我们使用人肺组织训练并盲目地测试了这些网络。使用较低的焦点精度的4倍焦点,我们成功地将专注于重点的自动荧光图像转换为高质量的虚拟H&E图像,与使用精心注重的自动荧光输入图像的标准虚拟染色框架相匹配。在不牺牲染色质量的情况下,该框架减少了无标签的全滑动图像(WSI)虚拟染色所需的总图像获取时间〜32%,同时降低了约89%的自动对焦时间,并且具有〜89%消除病理学中费力且昂贵的组织化学染色过程的潜力。
translated by 谷歌翻译